Processing Data-Intensive Workflows in the Cloud
نویسنده
چکیده
In the recent years, large-scale data analysis has become critical to the success of modern enterprise. Meanwhile, with the emergence of cloud computing, companies are attracted to move their data analytics tasks to the cloud due to its exible, on demand resources usage and pay-as-you-go pricing model. MapReduce has been widely recognized as an important tool for performing large-scale data analysis in the cloud. It provides a simple and fault-tolerance framework for users to process data-intensive analytics tasks in parallel across dierent physical machines. In this report, we survey alternative implementations of MapReduce, contrasting batched-oriented and pipelined execution models and study how these models impact response times, completion time and robustness. Next, we present three optimization strategies for MapReduce-style workows, including (1) scan sharing across MapReduce programs, (2) workow optimizations aimed at reducing intermediate data, and (3) scheduling policies that map work ow tasks to dierent machines in order to minimize completion times and monetary costs. We conclude with a brief comparison across these optimization strategies, and discuss their pros/cons as well as performance implications of using more than one optimization strategy at a time.University of Pennsylvania Department of Computer and Information Science Technical Report No. MS-CIS-12-07. Comments University of Pennsylvania Department of Computer and Information Science Technical Report No. MSCIS-12-08. This technical report is available at ScholarlyCommons: http://repository.upenn.edu/cis_reports/970 Processing Data-intensive Workflows in the Cloud WPE-II Written Report
منابع مشابه
A Clustering Approach to Scientific Workflow Scheduling on the Cloud with Deadline and Cost Constraints
One of the main features of High Throughput Computing systems is the availability of high power processing resources. Cloud Computing systems can offer these features through concepts like Pay-Per-Use and Quality of Service (QoS) over the Internet. Many applications in Cloud computing are represented by workflows. Quality of Service is one of the most important challenges in the context of sche...
متن کاملWaaS: Workflow-as-a-Service for the Cloud with Scheduling of Continuous and Data-Intensive Workflows
Data-intensive and long-lasting applications running in the form of workflows are being increasingly dispatched to cloud computing systems. Current scheduling approaches for graphs of dependencies fail to deliver high resource efficiency while keeping computation costs low, especially for continuous data processing workflows, where the scheduler does not perform any reasoning about the impact n...
متن کاملPlanning and Scheduling Data Processing Workflows in the Cloud with Quality-of-Data Constraints
Data-intensive and long-lasting applications running in the form of workflows are being increasingly more dispatched to cloud computing systems. Current scheduling approaches for graphs of dependencies fail to deliver high resource efficiency while keeping computation costs low, especially for continuous data processing workflows, where the scheduler does not perform any reasoning about the imp...
متن کاملOpportunities and Challenges for Running Scientific Workflows on the Cloud
Cloud computing is gaining tremendous momentum in both academia and industry. The application of Cloud computing, however, has mostly focused on Web applications and business applications; while the recognition of using Cloud computing to support large-scale workflows, especially dataintensive scientific workflows on the Cloud is still largely overlooked. We coin the term “Cloud Workflow”, to r...
متن کاملCloud Computing Technology Algorithms Capabilities in Managing and Processing Big Data in Business Organizations: MapReduce, Hadoop, Parallel Programming
The objective of this study is to verify the importance of the capabilities of cloud computing services in managing and analyzing big data in business organizations because the rapid development in the use of information technology in general and network technology in particular, has led to the trend of many organizations to make their applications available for use via electronic platforms hos...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014